DSL Shared Task 2016: Perfect Is The Enemy of Good Language Discrimination Through Expectation-Maximization and Chunk-based Language Model

نویسندگان

  • Ondrej Herman
  • Vit Suchomel
  • Vít Baisa
  • Pavel Rychlý
چکیده

We investigate two approaches to automatic discrimination of similar languages: Expectationmaximization algorithm for estimating conditional probability P (word|language) and a series of byte level language models. The accuracy of these methods reached 86.6 % and 88.3 %, respectively, on set A of the DSL Shared task 2016 competition.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Language and Dialect Discrimination Using Compression-Inspired Language Models

The DSL 2016 shared task continued previous evaluations from 2014 and 2015 that facilitated the study of automated language and dialect identification. This paper describes results for this year’s shared task and from several related experiments conducted at the Johns Hopkins University Human Language Technology Center of Excellence (JHU HLTCOE). Previously the HLTCOE has explored the use of co...

متن کامل

Second Language System of Motivational Selves and English Language Skills: The Role of Socio-Economic Status among Iranian Language Learners

Second Language System of Motivational Selves and English Language Skills: The Role of Socio-Economic Status among Iranian Language Learners1 A. Parastaar Aski* M.H. Abdollaahi, Ph.D.** A. R. Moraadi, Ph.D.*** H. R. Hassanaabaadi, Ph.D.****   Learning English in Iran is quite popular, yet not all are successful at this task. To shed light on this dilemma and its root causes, the ro...

متن کامل

Byte-based Language Identification with Deep Convolutional Networks

We report on our system for the shared task on discrimination of similar languages (DSL 2016). The system uses only byte representations in a deep residual network (ResNet). The system, named ResIdent, is trained only on the data released with the task (closed training). We obtain 84.88% accuracy on subtask A, 68.80% accuracy on subtask B1, and 69.80% accuracy on subtask B2. A large difference ...

متن کامل

Discrimination between Similar Languages, Varieties and Dialects using CNN- and LSTM-based Deep Neural Networks

In this paper, we describe a system (CGLI) for discriminating similar languages, varieties and dialects using convolutional neural networks (CNNs) and long short-term memory (LSTM) neural networks. We have participated in the Arabic dialect identification sub-task of DSL 2016 shared task for distinguishing different Arabic language texts under closed submission track. Our proposed approach is l...

متن کامل

مقایسه روش های طیفی برای شناسایی زبان گفتاری

Identifying spoken language automatically is to identify a language from the speech signal. Language identification systems can be divided into two categories, spectral-based methods and phonetic-based methods. In the former, short-time characteristics of speech spectrum are extracted as a multi-dimensional vector. The statistical model of these features is then obtained for each language. The ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016